Welcome
Placeholder
What you will learn
Skills
Data
What you won’t learn
Simple vs Easy
How to Contribute
1 Introduction to R and RStudio
Placeholder
1.1 Why learn to program?
1.1.1 Scale
1.1.2 Reproducibility
1.2 Using RStudio
1.2.1 Opening an R Script
1.2.2 Setting the working directory
1.2.3 Changing RStudio
1.2.3.1 General
1.2.3.2 Code
1.2.3.3 Appearance
1.2.3.4 Pane Layout
1.2.4 Helpful cheatsheets
1.3 Reading data into R
1.3.1 Loading data
1.4 First steps to exploring data
1.5 Finding help about functions
(PART) Clean
2 Subsetting: Making big things small
Placeholder
2.1 Select specific values
2.2 Assignment values to objects (Making “things”)
2.3 Vectors (collections of “things”)
2.4 Logical values and operations
2.4.1 Matching a single value
2.4.2 Matching multiple values
2.4.3 Does not match
2.4.4 Greater than or less than
2.4.5 Combining conditional statements - or, and
2.5 Subsetting a data.frame
2.5.1 Select specific columns
2.5.2 Select specific rows
2.5.3 Battleships
2.5.4 Subset Colorado data
3 Exploratory data analysis
Placeholder
3.1 Summary and Table
3.2 Graphing
3.3 Aggregating (summaries of groups)
4 Dates and times
Placeholder
4.1 Why do dates and times matter?
4.2 lubridate
4.3 Working with dates
4.4 Chicago crime data
4.4.1 Exercises
5 Regular Expressions
Placeholder
5.1 Finding patterns in text with grep()
5.2 Finding and replacing patterns in text with gsub()
5.3 Useful special characters
5.3.1 Multiple characters []
5.3.2 n-many of previous character {n}
5.3.3 n-many to m-many of previous character {n,m}
5.3.4 Start of string and “not” ^
5.3.5 End of string $
5.3.6 Anything .
5.3.7 One or more of previous +
5.3.8 Zero or more of previous *
5.3.9 Multiple patterns |
5.3.10 Parentheses ()
5.3.11 Optional text ?
5.4 Changing capitalization
(PART) Collect
6 Webscraping with rvest
Placeholder
6.1 Scraping one page
6.2 Cleaning the webscraped data
6.3 Fixing names
6.3.1 Exercises
7 Functions
Placeholder
7.1 A simple function
7.2 Adding parameters
7.3 Making a function to scrape movie data
8 For loops
Placeholder
8.1 Basic for loops
8.2 Scraping multiple days of movie data
9 Reading and Writing Data
Placeholder
9.1 Reading Data into R
9.1.1 R
9.1.2 Excel
9.1.3 Stata
9.1.4 SAS
9.1.5 SPSS
9.2 Writing Data
9.2.1 R
9.2.2 Excel
9.2.3 Stata
9.2.4 SAS
9.2.5 SPSS
10 Scraping data from PDFs
Placeholder
10.1 Downloading officer-involved Shooting Files
10.2 Scraping information from the page
10.2.1 Combining the data sets
10.3 Extracting data from PDFs
10.3.1 Scraping a single PDF
10.3.2 Making a function
10.3.3 Looping through every PDF
11 Scraping Tables from PDFs
Placeholder
11.1 Scraping the first table
11.2 Making a function
12 Geocoding
Placeholder
12.1 Geocoding a single address
12.2 Making a function
12.3 Geocoding officer shooting locations
(PART) Visualize
13 Graphing with ggplot2
Placeholder
13.1 What does the data look like?
13.2 Graphing data
13.3 Time-Series Plots
13.4 Color blindness
14 Hotspot maps
Placeholder
14.1 A simple map
14.2 What really are maps?
14.3 Making a hotspot map
14.3.1 Colors
14.4 Looping through each year
15 Choropleth maps
Placeholder
15.1 Spatial joins
15.2 Making choropleth maps
16 Interactive maps
While maps of data are useful, their ability to show incident-level information is quite limited. They tend to show broad trends - where crime happened in a city - rather than provide information about specific crime incidents. While broad trends are important, there are significant drawbacks about being unable to get important information about an incident without having to check the data. An interactive map bridges this gap by showing trends while allowing you to zoom into individual incidents and see information about each incident.
For this lesson we will continue to use the officer shooting data so let’s load that.
16.1 Why do interactive graphs matter?
16.1.1 Understanding your data
The most important thing to learn from this course is that understanding your data is crucial to good research. Making interactive maps is a very useful way to better understand your data as you can immediately see geographic patterns and quickly look at characteristics of those incidents to understand them.
In this lesson we will make a map of each officer-involved shooting that lets you click on the shooting and see some information about it. If we see a cluster of shootings, we can click on each shooting to see if they are similar. Though it is possible to find these patterns just looking at the data, it is easier to be able to see a geographic pattern and immediately look at information about each incident.
16.1.2 Police departments use them
Interactive maps are popular in large police departments such as Philadelphia and New York City. They allow easy understanding of geographic patterns in the data and, importantly, allow such access to people who do not have the technical skills necessary to create the maps. If nothing else, learning interactive maps will help you with a future job.
16.2 Making the interactive map
As usual, let’s take a look at the top 6 rows of the data.
head(officer_shootings_geocoded)
#> shooting_number location dates
#> 1 19-04 4900 Hazel Avenue, Philadelphia, PA 2019-03-06
#> 2 19-06 1300 Kater Street, Philadelphia, PA 2019-03-28
#> 4 19 11 2100 Taney Terrace, Philadelphia, PA 2019-04-25
#> 5 19-13 1800 N. Broad Street, Philadelphia, PA 2019-05-11
#> 6 19 14 3400 G Street, Philadelphia, PA 2019-05-20
#> 7 18-01 2800 Kensington Avenue, Philadelphia, PA 2018-01-13
#> lon lat
#> 1 -75.22087 39.95046
#> 2 -75.16355 39.94289
#> 4 -75.19104 39.92646
#> 5 -75.15754 39.98030
#> 6 -75.11482 39.99991
#> 7 -75.12253 39.99151This data is fairly sparse about information regarding the shooting. All it has is the date , shooting number, and address (which isn’t that useful as location is already covered by the map). The level of detail about the crime may be sparse, but we can still create a map where you can click an incident dot on the map and a popup will tell you when it happened.
We will use the package leaflet for our interactive map. leaflet produces maps similar to Google Maps with circles (or any icon we choose) for each value we add to the map. It allows you to zoom in, scroll around, and provides context to each incident that isn’t available on a static map.
To make a leaflet map we need to run the function leaflet() and add a tile to the map. A tile is simply the background of the map. This website provides a large number of potential tiles to use, though many are not relevant to our purposes of crime mapping.
We will use a standard tile from Open Street Maps. This tile gives street names and highlights important features such has parks and large stores which provides useful contexts for looking at the data. The attribution parameter isn’t strictly necessary but it is good form to say where your tile is from.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors')When you run the above code it shows a world map (copied several times). Zoom into it and it’ll start showing relevant features of wherever you’re looking.
Note the %>% between the leaflet() function and the addTiles() function. This is called a “pipe” in R and is used like the + in ggplot() to combine multiple functions together. This is used heavily in what is called the “tidyverse”, a series of packages that are prominent in modern R and useful for data analysis. We won’t be covering them in this book but for more information on them you can check the tidyverse website. For this lesson you need to know that each piece of the leaflet function must end with %>% for the next line to work.
To add the points to the graph we use the function addMarkers() which has two parameters, lng and lat. For both parameters we put the column in which the longitude and latitude are, respectively.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat)It now adds an icon indicating where every shooting in our data is. You can zoom in and scroll around to see more about where the shootings happen. These icons are a bit large, covering nearly all of the city and making it hard to see where shootings happen. To change the icons to circles we can change the function addMarkers() to addCircleMarkers(), keeping the rest of the code the same,
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat)This makes the icon into circles but they are still large and cover most of the map. To adjust the size of our icons we use the radius parameter in addMarkers() or addCircleMarkers(). The larger the radius, the larger the icons.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat,
radius = 5)Setting the radius option to 5 shrinks the size of the icon a lot. In your own maps you’ll have to fiddle with this option to get it to look the way you want. Let’s move on to adding information about each icon when clicked upon.
16.3 Adding popup information
The parameter popup in the addMarkers() or addCircleMarkers() functions lets you input a character value (if not already a character value it will convert it to one) and that will be shown as a popup when you click on the icon. Let’s start simple here by inputting the dates column in our data and then build it up to a more complicated popup.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat,
radius = 5,
popup = officer_shootings_geocoded$dates)Try clicking around and you’ll see that the data of the incident you clicked on appears over the dot. Though fairly clear in this case, we usually want to have a title indicating what the value in the popup means. We can do this by using the paste() function to combine text explaining the value with the value itself. Let’s add the words “Date of Shooting:” before the date.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat,
radius = 5,
popup = paste("Date of Shooting:", officer_shootings_geocoded$dates))We don’t have many other columns but we can add the location and shooting number to the popup by adding them to the paste() function we’re using.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat,
radius = 5,
popup = paste("Shooting Number:", officer_shootings_geocoded$shooting_number,
"Date:", officer_shootings_geocoded$dates,
"Location:", officer_shootings_geocoded$location))Just adding the location text makes it try to print out everything on one line which is hard to read. If we add the text <br> where we want a line break it will make one. <br> is the HTML tag for line-break which is why it works making a new line in this case.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat,
radius = 5,
popup = paste("Shooting Number:", officer_shootings_geocoded$shooting_number,
"<br>",
"Date:", officer_shootings_geocoded$dates,
"<br>",
"Location:", officer_shootings_geocoded$location))16.4 Dealing with too many markers
Even though we shrunk the size of the circles, it is still rather hard to see any trends as there are so many incidents and relatively large circles. One solution is to keep shrinking the size of the circles, but this quickly becomes a bad solution when using more frequent data such as a crime data set (Philadelphia data alone has about 200k crimes reported per year). The other solution is to cluster the data into groups where the dots only show if you zoom down.
If we add the code clusterOptions = markerClusterOptions() to our addCircleMarkers() it will cluster for us.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addCircleMarkers(lng = officer_shootings_geocoded$lon,
lat = officer_shootings_geocoded$lat,
radius = 5,
popup = paste("Shooting Number:", officer_shootings_geocoded$shooting_number,
"<br>",
"Date:", officer_shootings_geocoded$dates,
"<br>",
"Location:", officer_shootings_geocoded$location),
clusterOptions = markerClusterOptions())Incidents close to each other are grouped together in fairly arbitrary groupings and we can see how large each grouping is by moving our cursor over the circle. Click on a circle or zoom in and and it will show smaller groupings at lower levels of aggregation. Keep clicking or zooming in and it will eventually show each incident as its own circle.
This method is very useful for dealing with huge amounts of data as it avoids overflowing the map with too many icons at one time. A downside, however, is that the clusters are created arbitrarily meaning that important context, such as neighborhood, can be lost.
16.5 Interactive choropleth maps
In Chapter @ref(choropleth-maps) we worked on choropleth maps which are maps with shaded regions, such as states colored by which political party won them in an election. Here we will make interactive choropleth maps where you can click on a shaded region and see information about that region. We’ll make the same map as before - Census tracts with the number of officer-involved shootings.
Let’s load the tract-level officer-involved shooting data we made earlier.
We’ll begin the leaflet map similar to before but use the function addPolygons() and our input here is the geometry column of philly_tracts_shootings.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry)
#> Warning: sf layer is not long-lat data
#> Warning: sf layer has inconsistent datum (+proj=lcc +lat_1=40.96666666666667 +lat_2=39.93333333333333 +lat_0=39.33333333333334 +lon_0=-77.75 +x_0=600000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs).
#> Need '+proj=longlat +datum=WGS84'It gives us a blank map because our polygons are projected to Philly’s projection while the leaflet map expects the standard CRS, WGS84 which uses longitude and latitude. So we need to change our projection to that using the st_transform() function from the sf package.
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
philly_tracts_shootings <- st_transform(philly_tracts_shootings,
crs = "+proj=longlat +datum=WGS84")Now let’s try again.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry)It made a map with large blue lines indicating each tract. Let’s change the appearance of the graph a bit before making a popup or shading the tracts. The parameter color in addPolygons() changes the color of the lines - let’s change it to black. The lines are also very large, blurring into each other and making the tracts hard to see. We can change the weight parameter to alter the size of these lines - smaller values are smaller lines. Let’s try setting this to 1.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry,
color = "black",
weight = 1)That looks better and we can clearly distinguish each tract now.
As we did earlier, we can add the popup text directly to the function which makes the geographic shapes, in this case addPolygons(). Let’s add the GEOID10 column value - the unique ID code for that tract - and the number of shootings that occurred in that tract. As before when we click on a tract a popup appears with the output we specified.
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry,
col = "black",
weight = 1,
popup = paste0("Tract ID: ", philly_tracts_shootings$GEOID10,
"<br>",
"Number of Shootings: ", philly_tracts_shootings$number_shootings))For these types of maps we generally want to shade each polygon to indicate how frequently the event occurs in the polygon. For this process we will make a simple function which will automatically shade the tracts by the value in the column we want it shaded by - number_shootings.
We’ll use the function colorNumeric() to make our colors, which takes a lot of the work out of this process. This function takes two inputs, first a color palette which we can get from the site colorbrewer2. Let’s use the fourth bar in the Sequential page, which is light orange to red. If you look in the section with each HEX value it says that the palette is “3-class OrRd”. The “3-class” just means we selected 3 colors, the “OrRd” is the part we want. That will tell colorNumeric() to make the palette using these colors. The second parameter is the column for our numeric variable, number_shootings.
We will save the output of colorNumeric("OrRd", philly_tracts_shootings$number_shootings) as a new variable which we’ll call pal for convenience. Then inside of addPolygons() we’ll set the parameter fillColor to pal(philly_tracts_shootings$number_shootings), running this function on the column. What this really does it determine which color every tract should be based on the value in the number_shootings column.
pal <- colorNumeric("OrRd", philly_tracts_shootings$number_shootings)
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry,
col = "black",
weight = 1,
popup = paste0("Tract ID: ", philly_tracts_shootings$GEOID10,
"<br>",
"Number of Shootings: ", philly_tracts_shootings$number_shootings),
fillColor = pal(philly_tracts_shootings$number_shootings))Since the tracts are transparent, it is hard to distinguish which color is shown. We can make each tract a solid color by setting the parameter fillOpacity inside of addPolygons() to 1.
pal <- colorNumeric("OrRd", philly_tracts_shootings$number_shootings)
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry,
col = "black",
weight = 1,
popup = paste0("Tract ID: ", philly_tracts_shootings$GEOID10,
"<br>",
"Number of Shootings: ", philly_tracts_shootings$number_shootings),
fillColor = pal(philly_tracts_shootings$number_shootings),
fillOpacity = 1)To add a legend to this we use the function addLegend() which takes three parameters. pal asks which color palette we are using - we want it to be the exact same as we use to color the tracts so we’ll use the pal object we made. The values parameter is used for which column our numeric values are from, in our case the number_shootings column so we’ll input that. Finally opacity determines how transparent the legend will be. As each tract is set to not be transparent at all, we’ll also set this to 1.
pal <- colorNumeric("OrRd", philly_tracts_shootings$number_shootings)
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry,
col = "black",
weight = 1,
popup = paste0("Tract ID: ", philly_tracts_shootings$GEOID10,
"<br>",
"Number of Shootings: ", philly_tracts_shootings$number_shootings),
fillColor = pal(philly_tracts_shootings$number_shootings),
fillOpacity = 1) %>%
addLegend(pal = pal,
values = philly_tracts_shootings$number_shootings,
opacity = 1)Finally, we can add a title to the legend using the title parameter inside of addLegend().
pal <- colorNumeric("OrRd", philly_tracts_shootings$number_shootings)
leaflet() %>%
addTiles('http://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png',
attribution = '© <a href="http://openstreetmap.org">
OpenStreetMap</a> contributors') %>%
addPolygons(data = philly_tracts_shootings$geometry,
col = "black",
weight = 1,
popup = paste0("Tract ID: ", philly_tracts_shootings$GEOID10,
"<br>",
"Number of Shootings: ", philly_tracts_shootings$number_shootings),
fillColor = pal(philly_tracts_shootings$number_shootings),
fillOpacity = 1) %>%
addLegend(pal = pal,
values = philly_tracts_shootings$number_shootings,
opacity = 1,
title = "Police Shootings")17 More graphing with ggplot
Placeholder
17.1 Graphing a single variable
17.1.1 Numeric variable
17.1.2 Categorical variable
17.2 Time Series
18 R Markdown
Placeholder
18.1 Code
18.1.1 Hiding code in the output
18.2 Tables
18.3 Making the output file
(PART) Data
19 Introduction
At this point you have learned how to read in data, manipulate it to get just the parts you want or to aggregate it to the level you want, and visualize it through maps or graphs. You’ve done so using data sets that are commonly used in criminological research.
In the next several chapters we will be introducing a number of other data sets - or looking deeper into data we’ve already seen - that are common in criminology. While these chapters do use R a bit to explore or read in the data, they are primarily a discussion of the trade-offs of using each data set. Some of the data sets are difficult to read into R, requiring more steps than you may be useful, so these chapters will also discuss how to get that data into R.
20 Uniform Crime Report (UCR) Data - Offenses Known and Clearances by Arrest
Placeholder
20.1 Exploring the UCR data
20.2 ORIs - Unique agency identifiers
20.3 Hierarchy Rule
20.4 Which crimes are included
20.4.1 Index Crimes
20.4.2 The problem with using index crimes
20.4.3 Rape definition change
20.5 Actual offenses, clearances, and unfounded offenses
20.5.1 Actual
20.5.2 Total Cleared
20.5.3 Cleared Where All Offenders Are Under 18
20.5.4 Unfounded
20.6 Number of months reported
22 Census data from IPUMS
Placeholder
22.1 Getting IPUMS data
22.2 Cleaning the data
22.3 Aggregating the data
22.4 Graphing the data
22.5 Mapping the data
23 National Incident-Based Reporting System (NIBRS) Data
Placeholder
23.1 Downloading the data
23.2 Reading the data
(APPENDIX) Appendix
24 Useful resources
24.0.1 Learning R and coding issues
R for Data Science - This free online book provides a good introduction for R though it differs in several important ways from this class.
Stack Overflow - Stack Overflow is a website that answers programming-related questions. It’s like the Yahoo Answers of programming. That said, a lot of the answer are bad. Some answers are overly confusing or provide code that you may not understand. You can use this source, but don’t rely too heavily on it. Its search function isn’t great so it’s better to Google your question and choose the stackoverflow.com result.